Recipe Diet Classification with NLP¶
1. Problem Statement Formulation and Definition¶
Motivation¶
The motivation behind this project stems from the interest in understanding the relationship between the ingredients and diet of a recipe from a calories and protein intake perspective. The availability of an ingredient-based recipe diet categorization can help in identifying if the recipe or dish is suitable for based on the "calories" and "protein" diet categorization.
Problem Statement¶
This project aims to develop an ingredient-based recipe diet classification model that identify if a recipe is low calorie, high protein, or other based on the ingredients.
Expected Results¶
The developed NLP model is expected to be able to properly identify if the recipe is a low calorie, high protein, or other diet recipe based on the given ingredients as it should be able to properly process the data of the recipes diet category and ingredients.
2. Selection of an appropriate data set (Data Collection)¶
Data Selection and Justification¶
The dataset used for this project is the "Food.com Recipes and Interactions" dataset from Kaggle (Li, 2019). It contains a rich collection of recipe records extracted from a recipes website which is Food.com. Moreover, the records of this dataset is highly relatable to the project objective of identifying recipe diet based on ingredients as each record contains the required information which are the recipe ingredients that the NLP model will mainly depend on, in addition to the recipe nutrition information including the calories and protein which will be used to create the diet category based on its values.
Data Visualization and Exploratory Data Analysis¶
Project Imports
# for data
import pandas as pd
import ast
from collections import Counter
import numpy as np
# for visualization
import plotly.express as px
import plotly.io as pio
import plotly.subplots as sp
import plotly.graph_objects as go
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
# for text processing
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# for encoding categorical labels
from sklearn.preprocessing import LabelEncoder
# for text representation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
# for model development
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
# for model evaluation
import scikitplot as skplt
from sklearn.metrics import classification_report
nltk.download('punkt') # for tokenization
nltk.download('stopwords')
nltk.download('wordnet') # for lemmatization
# to ensure plotly graphs are exported
pio.renderers.default = "plotly_mimetype+notebook"
[nltk_data] Downloading package punkt to [nltk_data] C:\Users\alaa2\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package stopwords to [nltk_data] C:\Users\alaa2\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package wordnet to [nltk_data] C:\Users\alaa2\AppData\Roaming\nltk_data... [nltk_data] Package wordnet is already up-to-date!
# read the dataset and save it into a pandas dataframe (df)
data = pd.read_csv("data/recipes/RAW_recipes.csv")
# display first 5 rows of the data
data.head()
| name | id | minutes | contributor_id | submitted | tags | nutrition | n_steps | steps | description | ingredients | n_ingredients | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | arriba baked winter squash mexican style | 137739 | 55 | 47892 | 2005-09-16 | ['60-minutes-or-less', 'time-to-make', 'course... | [51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0] | 11 | ['make a choice and proceed with recipe', 'dep... | autumn is my favorite time of year to cook! th... | ['winter squash', 'mexican seasoning', 'mixed ... | 7 |
| 1 | a bit different breakfast pizza | 31490 | 30 | 26278 | 2002-06-17 | ['30-minutes-or-less', 'time-to-make', 'course... | [173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0] | 9 | ['preheat oven to 425 degrees f', 'press dough... | this recipe calls for the crust to be prebaked... | ['prepared pizza crust', 'sausage patty', 'egg... | 6 |
| 2 | all in the kitchen chili | 112140 | 130 | 196586 | 2005-02-25 | ['time-to-make', 'course', 'preparation', 'mai... | [269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0] | 6 | ['brown ground beef in large pot', 'add choppe... | this modified version of 'mom's' chili was a h... | ['ground beef', 'yellow onions', 'diced tomato... | 13 |
| 3 | alouette potatoes | 59389 | 45 | 68585 | 2003-04-14 | ['60-minutes-or-less', 'time-to-make', 'course... | [368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0] | 11 | ['place potatoes in a large pot of lightly sal... | this is a super easy, great tasting, make ahea... | ['spreadable cheese with garlic and herbs', 'n... | 11 |
| 4 | amish tomato ketchup for canning | 44061 | 190 | 41706 | 2002-10-25 | ['weeknight', 'time-to-make', 'course', 'main-... | [352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0] | 5 | ['mix all ingredients& boil for 2 1 / 2 hours ... | my dh's amish mother raised him on this recipe... | ['tomato juice', 'apple cider vinegar', 'sugar... | 8 |
Exploratory Data Analysis (EDA)¶
# data summary
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 231637 entries, 0 to 231636 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 231636 non-null object 1 id 231637 non-null int64 2 minutes 231637 non-null int64 3 contributor_id 231637 non-null int64 4 submitted 231637 non-null object 5 tags 231637 non-null object 6 nutrition 231637 non-null object 7 n_steps 231637 non-null int64 8 steps 231637 non-null object 9 description 226658 non-null object 10 ingredients 231637 non-null object 11 n_ingredients 231637 non-null int64 dtypes: int64(5), object(7) memory usage: 21.2+ MB
Dataset Summary¶
There are 231637 records and 12 columns in the dataset with 5 columns containing numerical data and 7 columns containing objects as can be seen in the dataframe summary above which contains the columns names as well.
Statistics for numeric values¶
stats = data[["minutes", "n_steps", "n_ingredients"]].describe()
stats.astype(int)
| minutes | n_steps | n_ingredients | |
|---|---|---|---|
| count | 231637 | 231637 | 231637 |
| mean | 9398 | 9 | 9 |
| std | 4461963 | 5 | 3 |
| min | 0 | 0 | 1 |
| 25% | 20 | 6 | 6 |
| 50% | 40 | 9 | 9 |
| 75% | 65 | 12 | 11 |
| max | 2147483647 | 145 | 43 |
Data Visualization¶
ingredients_fig = px.histogram(
data,
x="n_ingredients",
title="Distribution of Recipe's number of ingredients",
labels={"n_ingredients": "Number of Ingredients", "count": "Count"},
marginal="box",
)
ingredients_fig.update_layout(bargap=0.2)
ingredients_fig.show()
steps_fig = px.histogram(
data,
x="n_steps",
title="Distribution of Recipe's number of preparation steps",
labels={"n_steps": "Number of Preparation Steps", "count": "Count"},
marginal="box",
)
steps_fig.update_layout(bargap=0.2)
steps_fig.show()
time_fig = px.histogram(
data,
x="minutes",
title="Distribution of Recipe's preparation time",
labels={"minutes": "Preparation Time", "count": "Count"},
marginal="box",
)
time_fig.update_layout(bargap=0.2)
time_fig.show()
The visualization is not clear due to the huge values of the outliers.
Based on the box plot the upper fence of the recipes preparation time is 132. Therefore, a subset of the preparation time column is created with an upper limit of 132 to get a more clear visualization.
# create a copy of the dataframe with only preparation time less than or equal to 132
df_filtered_minutes = data[data["minutes"] <= 132]
A more clear visualization of the recipes preparation time.
time_fig3 = px.histogram(
df_filtered_minutes,
x="minutes",
title="Distribution of Recipe's preparation time (recipes with preparation time <= 132)",
labels={"minutes": "Preparation Time", "count": "Count"},
marginal="box",
)
time_fig3.update_layout(bargap=0.2)
time_fig3.show()
It is difficult to determine wether all the outliers in the analyzed 3 numerical columns are correct or wrong. Moreover, these columns are of not much significance for the case of identifying the diet category based on the ingredients which will mostly focus on the text and names of the ingredients. Therefore, these columns can be dropped from the dataset.
Data Preprocessing¶
The first step would be dropping the unrelated columns identified during data visualization
# drop n_steps, n_ingredients, minutes columns
data.drop(
columns=["n_steps", "n_ingredients", "minutes"],
inplace=True,
)
Based on the dataset summary, two columns which are the name and description columns have null values.
# checking null values
data.isnull().sum()
name 1 id 0 contributor_id 0 submitted 0 tags 0 nutrition 0 steps 0 description 4979 ingredients 0 dtype: int64
- 1 record is missing a name
- 4979 records are missing a description
# checking the values in description column
data["description"].head().values
array(['autumn is my favorite time of year to cook! this recipe \r\ncan be prepared either spicy or sweet, your choice!\r\ntwo of my posted mexican-inspired seasoning mix recipes are offered as suggestions.',
'this recipe calls for the crust to be prebaked a bit before adding ingredients. feel free to change sausage to ham or bacon. this warms well in the microwave for those late risers.',
"this modified version of 'mom's' chili was a hit at our 2004 christmas party. we made an extra large pot to have some left to freeze but it never made it to the freezer. it was a favorite by all. perfect for any cold and rainy day. you won't find this one in a cookbook. it is truly an original.",
'this is a super easy, great tasting, make ahead side dish that looks like you spent a lot more time preparing than you actually do. plus, most everything is done in advance. the times do not reflect the standing time of the potatoes.',
"my dh's amish mother raised him on this recipe. he much prefers it over store-bought ketchup. it was a taste i had to acquire, but now my ds's also prefer this type of ketchup. enjoy!"],
dtype=object)
The number of missing values in the description column is very large and based on the available values, this column contains some descriptions or general information written by the uploader of the recipe. Therefore, it is identified that this column is of no relevance and can be dropped from the dataset.
# drop description column
data.drop(
columns=["description"],
inplace=True,
)
# checking the row with missing recipe name
missing_name_idx = data[data["name"].isnull()].index
data.loc[missing_name_idx].values
array([[nan, 368257, 779451, '2009-04-27',
"['15-minutes-or-less', 'time-to-make', 'course', 'preparation', 'low-protein', 'salads', 'easy', 'salad-dressings', 'dietary', 'low-sodium', 'inexpensive', 'low-in-something', '3-steps-or-less']",
'[1596.2, 249.0, 155.0, 0.0, 2.0, 112.0, 14.0]',
"['in a bowl , combine ingredients except for olive oil', 'slowly whisk inches', 'olive oil until thickened', 'great with field greens', 'makes about 2 / 3', 'cup dressing']",
"['lemon', 'honey', 'horseradish mustard', 'garlic clove', 'dried parsley', 'dried basil', 'dried thyme', 'garlic salt', 'black pepper', 'olive oil']"]],
dtype=object)
Regarding the missing value in the name column, the values of the record containing the missing name suggest that its a recipe of a salad dressing. This can be inferred mainly from the tags associated with the recipe and the ingredients. Accordingly, the null value can be replaced with "Salad Dressing" as a name.
In the case where multiple rows have a missing name, a possible solution for handling the null values could be through assigning a random name generated from the recipe tags.
# assign the selected name in place of the missing name
data.loc[missing_name_idx, "name"] = "Salad Dressing"
# view the updates on the row
data.loc[missing_name_idx].values
array([['Salad Dressing', 368257, 779451, '2009-04-27',
"['15-minutes-or-less', 'time-to-make', 'course', 'preparation', 'low-protein', 'salads', 'easy', 'salad-dressings', 'dietary', 'low-sodium', 'inexpensive', 'low-in-something', '3-steps-or-less']",
'[1596.2, 249.0, 155.0, 0.0, 2.0, 112.0, 14.0]',
"['in a bowl , combine ingredients except for olive oil', 'slowly whisk inches', 'olive oil until thickened', 'great with field greens', 'makes about 2 / 3', 'cup dressing']",
"['lemon', 'honey', 'horseradish mustard', 'garlic clove', 'dried parsley', 'dried basil', 'dried thyme', 'garlic salt', 'black pepper', 'olive oil']"]],
dtype=object)
Identification of unrelated / irrelevant columns¶
# viewing the first 5 rows of the dataframe to check the values
data.head()
| name | id | contributor_id | submitted | tags | nutrition | steps | ingredients | |
|---|---|---|---|---|---|---|---|---|
| 0 | arriba baked winter squash mexican style | 137739 | 47892 | 2005-09-16 | ['60-minutes-or-less', 'time-to-make', 'course... | [51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0] | ['make a choice and proceed with recipe', 'dep... | ['winter squash', 'mexican seasoning', 'mixed ... |
| 1 | a bit different breakfast pizza | 31490 | 26278 | 2002-06-17 | ['30-minutes-or-less', 'time-to-make', 'course... | [173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0] | ['preheat oven to 425 degrees f', 'press dough... | ['prepared pizza crust', 'sausage patty', 'egg... |
| 2 | all in the kitchen chili | 112140 | 196586 | 2005-02-25 | ['time-to-make', 'course', 'preparation', 'mai... | [269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0] | ['brown ground beef in large pot', 'add choppe... | ['ground beef', 'yellow onions', 'diced tomato... |
| 3 | alouette potatoes | 59389 | 68585 | 2003-04-14 | ['60-minutes-or-less', 'time-to-make', 'course... | [368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0] | ['place potatoes in a large pot of lightly sal... | ['spreadable cheese with garlic and herbs', 'n... |
| 4 | amish tomato ketchup for canning | 44061 | 41706 | 2002-10-25 | ['weeknight', 'time-to-make', 'course', 'main-... | [352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0] | ['mix all ingredients& boil for 2 1 / 2 hours ... | ['tomato juice', 'apple cider vinegar', 'sugar... |
The columns 'id', 'contributor_id', and 'submitted' are irrelevant to the project's use case and the NLP model development as they are information related to the submission of the recipes in the website that the data was retrieved from. Hence, these columns can be dropped, in addition to the 'tags', 'name', and 'steps' column which are not required for this use case.
# drop id, contributor_id, submitted, n_steps, n_ingredients, tags columns
data.drop(
columns=["id", "contributor_id", "submitted", "tags", "name", "steps"],
inplace=True,
)
Further processing¶
The nutrition column requires preprocessing to infer the required calories information from it and assign the proper categories for each record. Based on the data card details in kaggle where the dataset was retrieved from (Li, 2019), this column contains the nutrition information as in calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), and carbohydrates (PDV) consecutively. Therefore, the nutrition column can be split into multiple columns to represent each value separately.
# ensure that the nutrition column contains lists of floating values instead of a string
# using ast "Abstract Syntax Trees"
data["nutrition"] = data["nutrition"].apply(ast.literal_eval)
# create the new columns and assign the values to them
data[
[
"calories (#)",
"total fat (PDV)",
"sugar (PDV)",
"sodium (PDV)",
"protein (PDV)",
"saturated fat (PDV)",
"carbohydrates (PDV)",
]
] = data["nutrition"].to_list()
# drop the nutrition column as it's of no more use
data.drop(columns=["nutrition"], inplace=True)
# update ingredients to ensure that they are lists of strings not one string objects
data["ingredients"] = data["ingredients"].apply(ast.literal_eval)
# view updated dataframe
data.head()
| ingredients | calories (#) | total fat (PDV) | sugar (PDV) | sodium (PDV) | protein (PDV) | saturated fat (PDV) | carbohydrates (PDV) | |
|---|---|---|---|---|---|---|---|---|
| 0 | [winter squash, mexican seasoning, mixed spice... | 51.5 | 0.0 | 13.0 | 0.0 | 2.0 | 0.0 | 4.0 |
| 1 | [prepared pizza crust, sausage patty, eggs, mi... | 173.4 | 18.0 | 0.0 | 17.0 | 22.0 | 35.0 | 1.0 |
| 2 | [ground beef, yellow onions, diced tomatoes, t... | 269.8 | 22.0 | 32.0 | 48.0 | 39.0 | 27.0 | 5.0 |
| 3 | [spreadable cheese with garlic and herbs, new ... | 368.1 | 17.0 | 10.0 | 2.0 | 14.0 | 8.0 | 20.0 |
| 4 | [tomato juice, apple cider vinegar, sugar, sal... | 352.9 | 1.0 | 337.0 | 23.0 | 3.0 | 0.0 | 28.0 |
# checking calories and protein columns statistics
data[["calories (#)", "protein (PDV)"]].describe()
| calories (#) | protein (PDV) | |
|---|---|---|
| count | 231637.000000 | 231637.00000 |
| mean | 473.942425 | 34.68186 |
| std | 1189.711374 | 58.47248 |
| min | 0.000000 | 0.00000 |
| 25% | 174.400000 | 7.00000 |
| 50% | 313.400000 | 18.00000 |
| 75% | 519.700000 | 51.00000 |
| max | 434360.200000 | 6552.00000 |
The very large maximum value is mostly the effect of inconsistent units. As some entries seem to have the records in kilocalories(kcal), which seems to be the majority based on the statistics, while others are in calories(cal). Therefore, standardizing the units to kilocalories is required to have accurate diet categories.
# One kilocalorie is equivalent to 1000 calories.
data["calories (kcal)"] = data["calories (#)"].apply(lambda x: x / 1000 if x > 1000 else x)
As there is a minimum value of 0 for the calories and protein, this might be due to missing input. Therefore, records with 0 for both calories and protein will be dropped to ensure data quality.
data = data[(data["calories (kcal)"] > 0) & (data["protein (PDV)"] > 0)]
Calories visualization to analyze the updated statistics and distribution
calories_fig = px.histogram(
data,
x="calories (kcal)",
title="Distribution of recipe calories intake in kilocalories",
labels={"calories (kcal)": "calories (kcal)", "count": "Count"},
marginal="box",
)
calories_fig.update_layout(bargap=0.2)
calories_fig.show()
The current distribution is more balanced as all the values are in the same unit which is kilocalories.
Labels Identification¶
# create a new column with empty strings
data["diet"] = ""
The calorie threshold is set to 100 based on the information from "Reading Food Nutrition Labels" about food nutrition values (Reading Food Nutrition Labels). While the protein threshold is set to 30, as the recommended protein intake per serving is between 15 - 30 (Wempen, 2022).
def categorize_diet(row):
# Define the standards for each nutritional value
calorie_threshold = 100
protein_threshold = 30
# Categorize based on "calories (#)" and "protein (PDV)"
if row["calories (#)"] < calorie_threshold:
return "Low-Calorie Diet"
elif row["protein (PDV)"] > protein_threshold:
return "High-Protein Diet"
else:
return "Other Diet"
# Apply the categorize_diet function to create a new column "Diet Type"
# data["Diet Type"] = data.apply(categorize_diet, axis=1)
data["diet"] = data.apply(categorize_diet, axis=1)
# Create the bar plot
fig = px.histogram(x=data['diet'], labels={'x':'Category'}, title='Categories Distribution')
fig.show()
Most of the recipes are of other diet, followed by high-protein diet and low-calorie diet being the least.
In order to balance the data, a subset of the dataset is taken with a similar number of records from each category.
# use the minimum count of the diet category as the number of samples
min_count = data['diet'].value_counts().min()
n_samples = min_count if min_count<=10000 else 10000
# groupby 'diet' and take same amount of samples from each group
df = data.groupby('diet', group_keys=False).apply(lambda x: x.sample(min(len(x), n_samples), random_state=42))
# shuffle the rows in the dataframe to ensure that the similar category values are not grouped together
df = df.sample(frac=1, random_state=42).reset_index(drop=True)
# Reset the index
df.reset_index(drop=True, inplace=True)
# Create the bar plot
fig = px.histogram(x=df['diet'], labels={'x':'Category'}, title='Categories Distribution')
fig.show()
# Instantiate the encoder
le = LabelEncoder()
# Fit and transform the diet labels
df["label"] = le.fit_transform(df['diet'])
# dropping nutrition columns
df.drop(
columns=[
"calories (#)",
"total fat (PDV)",
"sugar (PDV)",
"sodium (PDV)",
"protein (PDV)",
"saturated fat (PDV)",
"carbohydrates (PDV)",
"calories (kcal)",
],
inplace=True,
)
Processed data summary and information¶
# checking the data summary after the processing
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 30000 entries, 0 to 29999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ingredients 30000 non-null object 1 diet 30000 non-null object 2 label 30000 non-null int32 dtypes: int32(1), object(2) memory usage: 586.1+ KB
The processed data has 30000 records with 3 columns instead of the initial 12 columns as most of the initial columns were dropped except the ingredients. The other two columns are diet which is the categorical label, and label which is the encoded label.
Processed Data Visualization and Analysis¶
The analysis in this part focuses on the ingredients column as it is the column that the NLP model will depend on.
# flatten all ingredients values into a 1D list
flatten_ingredients = [ingredient for ingredients in df['ingredients'].tolist() for ingredient in ingredients]
# Count the frequency of each ingredient
ingredient_counts = Counter(flatten_ingredients)
# Get the top 30 most common ingredients
top_ingredients = ingredient_counts.most_common(30)
steps, counts = zip(*top_ingredients)
# Create the bar plot
fig = px.bar(x=steps, y=counts, labels={'x':'Ingredients', 'y':'Counts'}, title='Top 30 Ingredient Frequencies')
fig.show()
The previous bar graph shows that salt is the most frequent ingredient in the dataset, followed by butter and sugar.
print(
"There are",
len(ingredient_counts),
"unique ingredients before processing the ingredients text.",
)
There are 8511 unique ingredients before processing the ingredients text.
def plot_cloud(count: Counter, title: str) -> None:
"""
This function plots a wordcloud
based on the ingredients counts in the dataframe
Parameters
----------
cont : Counter
The ingredients frequencies counter
"""
wordcloud = WordCloud(
width=1000,
height=500,
random_state=1,
background_color="black",
colormap="Pastel1",
collocations=False,
stopwords=STOPWORDS,
).generate_from_frequencies(count)
plt.figure(figsize=(10, 5))
plt.title(title)
plt.imshow(wordcloud)
plt.axis("off")
plot_cloud(ingredient_counts, "Unprocessed Ingredients")
The wordcloud shows that salt, butter, sugar, water, eggs, and onion are the most frequent ingredients as identified previously in the bar graph.
3. Text Preprocessing¶
Based on the analysis of the ingredients values, there are stopwords, numbers and special characters, in addition to plurals. Therefore, the main text preprocessing techniques required will be special characters, numbers and stopwords removal, tokenization, and lemmatization.
# create a new column in the dataframe to contain preprocessed ingredients
# and initialize with a copy of ingredients
df["pp ingredients"] = df["ingredients"].copy()
# display the new column for indexes 360 to 375
pd.set_option('display.max_colwidth', None) # to display the column in full width
df.loc[360:375, "pp ingredients"]
360 [king prawns, fresh parsley sprig, olive oil, barbecue sauce, lemon juice, honey, garlic cloves, fresh chives] 361 [wonton wrappers, ricotta cheese, frozen chopped spinach, egg yolk, parmesan cheese, salt and pepper] 362 [carrot, olive oil, balsamic vinegar, honey, chili powder, cumin, ginger, salt and pepper] 363 [old fashioned oats, water, cinnamon, dried fruits, spices] 364 [pork shoulder, olive oil, onion, tomatoes, garlic cloves, cumin powder, fresh oregano, whole cloves, bay leaves, dried chipotle chiles, water, salt and pepper] 365 [ground beef, ricotta cheese, basil pesto, egg, pasta sauce] 366 [bacon, fresh mushrooms, garlic, olive oil, butter, yellow onions, fresh green beans, fresh thyme, ground cumin, green onion, salt, pepper, cream cheese, whole milk, pie crusts, egg, half-and-half] 367 [flour, baking soda, salt, dark brown sugar, sugar, margarine, vanilla, egg whites, water, chocolate chips] 368 [graham wafer crumbs, almonds, sugar, butter, cream cheese, flour, eggs, sour cream, amaretto di saronno liqueur, apricot jam] 369 [ground round, lean ground turkey, water, diced tomatoes, celery, onion, carrots, beef bouillon cubes, potatoes, mrs. dash seasoning mix, pepper, salt] 370 [fennel bulb, olive oil, balsamic vinegar, dijon mustard, garlic, salt and pepper] 371 [chicken breasts, sun-dried tomatoes, frozen spinach, goat cheese, pine nuts, shallot, white wine, chicken broth, heavy cream, salt and pepper, oil] 372 [ham, light mayonnaise, walnuts, dijon mustard, curry powder, english cucumber, yellow sweet pepper] 373 [zucchini, green onions, feta cheese, basil leaves, eggs, salt & fresh ground pepper, plain flour, baking powder, vegetable oil] 374 [butter, brown sugar, eggs, baking soda, all-purpose flour, ground cinnamon, ground nutmeg, ground cloves, ground ginger, salt, pecans, candied citron peel] 375 [ketchup, brown sugar, yellow mustard, worcestershire sauce, onion, bell pepper, celery] Name: pp ingredients, dtype: object
Lowercasing¶
The first preprocessing technique applied on the text is lowercasing to ensure that all ingredients are in lowercase and to prevent having different cases of the words which can be interpreted as different words by the model.
def lowercase_ingredients(ingredients: list[str]) -> list[str]:
"""
This function lowercases all the ingredients
Parameters
----------
ingredients : list[str]
a list of ingredients to be processed
Returns
-------
list[str]
a list of lowercased ingredients
"""
ingredients = [ingredient.lower() for ingredient in ingredients]
return ingredients
df["pp ingredients"] = df["pp ingredients"].apply(lowercase_ingredients)
# display the preprocessed ingredients column for indexes 360 to 375
# after applying lowercase
df.loc[360:375, "pp ingredients"]
360 [king prawns, fresh parsley sprig, olive oil, barbecue sauce, lemon juice, honey, garlic cloves, fresh chives] 361 [wonton wrappers, ricotta cheese, frozen chopped spinach, egg yolk, parmesan cheese, salt and pepper] 362 [carrot, olive oil, balsamic vinegar, honey, chili powder, cumin, ginger, salt and pepper] 363 [old fashioned oats, water, cinnamon, dried fruits, spices] 364 [pork shoulder, olive oil, onion, tomatoes, garlic cloves, cumin powder, fresh oregano, whole cloves, bay leaves, dried chipotle chiles, water, salt and pepper] 365 [ground beef, ricotta cheese, basil pesto, egg, pasta sauce] 366 [bacon, fresh mushrooms, garlic, olive oil, butter, yellow onions, fresh green beans, fresh thyme, ground cumin, green onion, salt, pepper, cream cheese, whole milk, pie crusts, egg, half-and-half] 367 [flour, baking soda, salt, dark brown sugar, sugar, margarine, vanilla, egg whites, water, chocolate chips] 368 [graham wafer crumbs, almonds, sugar, butter, cream cheese, flour, eggs, sour cream, amaretto di saronno liqueur, apricot jam] 369 [ground round, lean ground turkey, water, diced tomatoes, celery, onion, carrots, beef bouillon cubes, potatoes, mrs. dash seasoning mix, pepper, salt] 370 [fennel bulb, olive oil, balsamic vinegar, dijon mustard, garlic, salt and pepper] 371 [chicken breasts, sun-dried tomatoes, frozen spinach, goat cheese, pine nuts, shallot, white wine, chicken broth, heavy cream, salt and pepper, oil] 372 [ham, light mayonnaise, walnuts, dijon mustard, curry powder, english cucumber, yellow sweet pepper] 373 [zucchini, green onions, feta cheese, basil leaves, eggs, salt & fresh ground pepper, plain flour, baking powder, vegetable oil] 374 [butter, brown sugar, eggs, baking soda, all-purpose flour, ground cinnamon, ground nutmeg, ground cloves, ground ginger, salt, pecans, candied citron peel] 375 [ketchup, brown sugar, yellow mustard, worcestershire sauce, onion, bell pepper, celery] Name: pp ingredients, dtype: object
A quick look on the initial data shows that the ingredients are lowercased as it can be seen that there is no difference between the previous printed output of this subset of the data and the lowercased output. However, this step is to ensure that all the ingredients are lowercased.
Special Characters and Numbers Removal¶
This step includes special characters and numbers removal as some ingredients names may contain special characters especially the hyphen "-" and percentage "%" as they can be seen frequently in the ingredients names like "10% low-fat milk" where this ingredient might also be available as "low fat milk" in another recipe.
def clean_text(ingredients: list[str]) -> list[str]:
"""
This function cleans the ingredients from any special characters and numbers
Parameters
----------
ingredients : list[str]
a list of ingredients to be processed
Returns
-------
list[str]
a list of ingredients after special characters and numbers removal
"""
ingredients = [
re.sub("[^A-Za-z ]+", " ", ingredient) for ingredient in ingredients
]
return ingredients
df["pp ingredients"] = df["pp ingredients"].apply(clean_text)
# display the preprocessed ingredients column for indexes 360 to 375
# after applying special characters and numbers removal
df.loc[360:375, "pp ingredients"]
360 [king prawns, fresh parsley sprig, olive oil, barbecue sauce, lemon juice, honey, garlic cloves, fresh chives] 361 [wonton wrappers, ricotta cheese, frozen chopped spinach, egg yolk, parmesan cheese, salt and pepper] 362 [carrot, olive oil, balsamic vinegar, honey, chili powder, cumin, ginger, salt and pepper] 363 [old fashioned oats, water, cinnamon, dried fruits, spices] 364 [pork shoulder, olive oil, onion, tomatoes, garlic cloves, cumin powder, fresh oregano, whole cloves, bay leaves, dried chipotle chiles, water, salt and pepper] 365 [ground beef, ricotta cheese, basil pesto, egg, pasta sauce] 366 [bacon, fresh mushrooms, garlic, olive oil, butter, yellow onions, fresh green beans, fresh thyme, ground cumin, green onion, salt, pepper, cream cheese, whole milk, pie crusts, egg, half and half] 367 [flour, baking soda, salt, dark brown sugar, sugar, margarine, vanilla, egg whites, water, chocolate chips] 368 [graham wafer crumbs, almonds, sugar, butter, cream cheese, flour, eggs, sour cream, amaretto di saronno liqueur, apricot jam] 369 [ground round, lean ground turkey, water, diced tomatoes, celery, onion, carrots, beef bouillon cubes, potatoes, mrs dash seasoning mix, pepper, salt] 370 [fennel bulb, olive oil, balsamic vinegar, dijon mustard, garlic, salt and pepper] 371 [chicken breasts, sun dried tomatoes, frozen spinach, goat cheese, pine nuts, shallot, white wine, chicken broth, heavy cream, salt and pepper, oil] 372 [ham, light mayonnaise, walnuts, dijon mustard, curry powder, english cucumber, yellow sweet pepper] 373 [zucchini, green onions, feta cheese, basil leaves, eggs, salt fresh ground pepper, plain flour, baking powder, vegetable oil] 374 [butter, brown sugar, eggs, baking soda, all purpose flour, ground cinnamon, ground nutmeg, ground cloves, ground ginger, salt, pecans, candied citron peel] 375 [ketchup, brown sugar, yellow mustard, worcestershire sauce, onion, bell pepper, celery] Name: pp ingredients, dtype: object
It can be seen that the hyphens, percentages, and numbers that were present in the data previously are now removed.
Tokenization¶
The third preprocessing technique is tokenization where each ingredient will be split into separate words instead of phrases.
def tokenize_ingredients(ingredients: list[str]) -> list[list[str]]:
"""
This function tokenizes each ingredient in the ingredients list
Parameters
----------
ingredients : list[str]
a list of ingredients to be processed
Returns
-------
list[list[str]]
a list of tokenized ingredients as lists of tokens
"""
ingredients = [word_tokenize(ingredient) for ingredient in ingredients]
return ingredients
df["pp ingredients"] = df["pp ingredients"].apply(tokenize_ingredients)
# display the preprocessed ingredients column for indexes 360 to 375
# after applying tokenization
df.loc[360:375, "pp ingredients"]
360 [[king, prawns], [fresh, parsley, sprig], [olive, oil], [barbecue, sauce], [lemon, juice], [honey], [garlic, cloves], [fresh, chives]] 361 [[wonton, wrappers], [ricotta, cheese], [frozen, chopped, spinach], [egg, yolk], [parmesan, cheese], [salt, and, pepper]] 362 [[carrot], [olive, oil], [balsamic, vinegar], [honey], [chili, powder], [cumin], [ginger], [salt, and, pepper]] 363 [[old, fashioned, oats], [water], [cinnamon], [dried, fruits], [spices]] 364 [[pork, shoulder], [olive, oil], [onion], [tomatoes], [garlic, cloves], [cumin, powder], [fresh, oregano], [whole, cloves], [bay, leaves], [dried, chipotle, chiles], [water], [salt, and, pepper]] 365 [[ground, beef], [ricotta, cheese], [basil, pesto], [egg], [pasta, sauce]] 366 [[bacon], [fresh, mushrooms], [garlic], [olive, oil], [butter], [yellow, onions], [fresh, green, beans], [fresh, thyme], [ground, cumin], [green, onion], [salt], [pepper], [cream, cheese], [whole, milk], [pie, crusts], [egg], [half, and, half]] 367 [[flour], [baking, soda], [salt], [dark, brown, sugar], [sugar], [margarine], [vanilla], [egg, whites], [water], [chocolate, chips]] 368 [[graham, wafer, crumbs], [almonds], [sugar], [butter], [cream, cheese], [flour], [eggs], [sour, cream], [amaretto, di, saronno, liqueur], [apricot, jam]] 369 [[ground, round], [lean, ground, turkey], [water], [diced, tomatoes], [celery], [onion], [carrots], [beef, bouillon, cubes], [potatoes], [mrs, dash, seasoning, mix], [pepper], [salt]] 370 [[fennel, bulb], [olive, oil], [balsamic, vinegar], [dijon, mustard], [garlic], [salt, and, pepper]] 371 [[chicken, breasts], [sun, dried, tomatoes], [frozen, spinach], [goat, cheese], [pine, nuts], [shallot], [white, wine], [chicken, broth], [heavy, cream], [salt, and, pepper], [oil]] 372 [[ham], [light, mayonnaise], [walnuts], [dijon, mustard], [curry, powder], [english, cucumber], [yellow, sweet, pepper]] 373 [[zucchini], [green, onions], [feta, cheese], [basil, leaves], [eggs], [salt, fresh, ground, pepper], [plain, flour], [baking, powder], [vegetable, oil]] 374 [[butter], [brown, sugar], [eggs], [baking, soda], [all, purpose, flour], [ground, cinnamon], [ground, nutmeg], [ground, cloves], [ground, ginger], [salt], [pecans], [candied, citron, peel]] 375 [[ketchup], [brown, sugar], [yellow, mustard], [worcestershire, sauce], [onion], [bell, pepper], [celery]] Name: pp ingredients, dtype: object
The data became a list of lists (list of ingredients and each ingredient is a list of tokens). Extra spaces are discarded as well.
Stopwords Removal¶
As for stopwords removal, it is used to remove the conjunctions and other stopwords that may be included in the ingredients names. This will help generalize the ingredients more, as two ingredients called "cookies and cream ice cream" and "cookies & cream ice cream" will be the same ingredient after this process.
def remove_stopwords(ingredients: list[list[str]]) -> list[list[str]]:
"""
This function removes stopwords from the ingredient lists.
Parameters
----------
ingredients : list[list[str]]
a list of tokenized ingredients as lists of tokens
Returns
-------
list[list[str]]
a list of tokenized ingredients as lists of tokens without stopwords
"""
stop_words = set(stopwords.words("english"))
ingredients = [
[word for word in ingredient if word not in stop_words]
for ingredient in ingredients
]
return ingredients
df["pp ingredients"] = df["pp ingredients"].apply(remove_stopwords)
# display the preprocessed ingredients column for indexes 360 to 375
# after applying stopwords removal
df.loc[360:375, "pp ingredients"]
360 [[king, prawns], [fresh, parsley, sprig], [olive, oil], [barbecue, sauce], [lemon, juice], [honey], [garlic, cloves], [fresh, chives]] 361 [[wonton, wrappers], [ricotta, cheese], [frozen, chopped, spinach], [egg, yolk], [parmesan, cheese], [salt, pepper]] 362 [[carrot], [olive, oil], [balsamic, vinegar], [honey], [chili, powder], [cumin], [ginger], [salt, pepper]] 363 [[old, fashioned, oats], [water], [cinnamon], [dried, fruits], [spices]] 364 [[pork, shoulder], [olive, oil], [onion], [tomatoes], [garlic, cloves], [cumin, powder], [fresh, oregano], [whole, cloves], [bay, leaves], [dried, chipotle, chiles], [water], [salt, pepper]] 365 [[ground, beef], [ricotta, cheese], [basil, pesto], [egg], [pasta, sauce]] 366 [[bacon], [fresh, mushrooms], [garlic], [olive, oil], [butter], [yellow, onions], [fresh, green, beans], [fresh, thyme], [ground, cumin], [green, onion], [salt], [pepper], [cream, cheese], [whole, milk], [pie, crusts], [egg], [half, half]] 367 [[flour], [baking, soda], [salt], [dark, brown, sugar], [sugar], [margarine], [vanilla], [egg, whites], [water], [chocolate, chips]] 368 [[graham, wafer, crumbs], [almonds], [sugar], [butter], [cream, cheese], [flour], [eggs], [sour, cream], [amaretto, di, saronno, liqueur], [apricot, jam]] 369 [[ground, round], [lean, ground, turkey], [water], [diced, tomatoes], [celery], [onion], [carrots], [beef, bouillon, cubes], [potatoes], [mrs, dash, seasoning, mix], [pepper], [salt]] 370 [[fennel, bulb], [olive, oil], [balsamic, vinegar], [dijon, mustard], [garlic], [salt, pepper]] 371 [[chicken, breasts], [sun, dried, tomatoes], [frozen, spinach], [goat, cheese], [pine, nuts], [shallot], [white, wine], [chicken, broth], [heavy, cream], [salt, pepper], [oil]] 372 [[ham], [light, mayonnaise], [walnuts], [dijon, mustard], [curry, powder], [english, cucumber], [yellow, sweet, pepper]] 373 [[zucchini], [green, onions], [feta, cheese], [basil, leaves], [eggs], [salt, fresh, ground, pepper], [plain, flour], [baking, powder], [vegetable, oil]] 374 [[butter], [brown, sugar], [eggs], [baking, soda], [purpose, flour], [ground, cinnamon], [ground, nutmeg], [ground, cloves], [ground, ginger], [salt], [pecans], [candied, citron, peel]] 375 [[ketchup], [brown, sugar], [yellow, mustard], [worcestershire, sauce], [onion], [bell, pepper], [celery]] Name: pp ingredients, dtype: object
The data became clean from stopwords as al stopwords identified in nltk stopwords package are removed.
Lemmatization¶
The last preprocessing technique is lemmatization which is a text normalization technique, it is mainly used to simplify the text and to turn plural words into their singular form. The reason for choosing lemmatization over stemming is mainly due to lemmatization having a higher accuracy than stemming and because it provides actual meaningful words from the dictionary as opposed by stemming that may produce words that are not meaningful (Balodi, 2020).
def lemmatize_ingredients(ingredients: list[list[str]]) -> list[str]:
"""
This function applies lemmatization on the ingredients
and returns the ingredients into phrases instead of word tokens
Parameters
----------
ingredients : list[list[str]]
a list of tokenized ingredients as lists of tokens
Returns
-------
list[str]
a list of lemmatized and processed ingredients
"""
lem = WordNetLemmatizer()
ingredients = [
[lem.lemmatize(word) for word in ingredient] for ingredient in ingredients
]
ingredients = [" ".join(ingredient) for ingredient in ingredients]
return ingredients
df["pp ingredients"] = df["pp ingredients"].apply(lemmatize_ingredients)
# display the preprocessed ingredients column for indexes 360 to 375
# after applying lemmatization
df.loc[360:375, "pp ingredients"]
360 [king prawn, fresh parsley sprig, olive oil, barbecue sauce, lemon juice, honey, garlic clove, fresh chive] 361 [wonton wrapper, ricotta cheese, frozen chopped spinach, egg yolk, parmesan cheese, salt pepper] 362 [carrot, olive oil, balsamic vinegar, honey, chili powder, cumin, ginger, salt pepper] 363 [old fashioned oat, water, cinnamon, dried fruit, spice] 364 [pork shoulder, olive oil, onion, tomato, garlic clove, cumin powder, fresh oregano, whole clove, bay leaf, dried chipotle chile, water, salt pepper] 365 [ground beef, ricotta cheese, basil pesto, egg, pasta sauce] 366 [bacon, fresh mushroom, garlic, olive oil, butter, yellow onion, fresh green bean, fresh thyme, ground cumin, green onion, salt, pepper, cream cheese, whole milk, pie crust, egg, half half] 367 [flour, baking soda, salt, dark brown sugar, sugar, margarine, vanilla, egg white, water, chocolate chip] 368 [graham wafer crumb, almond, sugar, butter, cream cheese, flour, egg, sour cream, amaretto di saronno liqueur, apricot jam] 369 [ground round, lean ground turkey, water, diced tomato, celery, onion, carrot, beef bouillon cube, potato, mr dash seasoning mix, pepper, salt] 370 [fennel bulb, olive oil, balsamic vinegar, dijon mustard, garlic, salt pepper] 371 [chicken breast, sun dried tomato, frozen spinach, goat cheese, pine nut, shallot, white wine, chicken broth, heavy cream, salt pepper, oil] 372 [ham, light mayonnaise, walnut, dijon mustard, curry powder, english cucumber, yellow sweet pepper] 373 [zucchini, green onion, feta cheese, basil leaf, egg, salt fresh ground pepper, plain flour, baking powder, vegetable oil] 374 [butter, brown sugar, egg, baking soda, purpose flour, ground cinnamon, ground nutmeg, ground clove, ground ginger, salt, pecan, candied citron peel] 375 [ketchup, brown sugar, yellow mustard, worcestershire sauce, onion, bell pepper, celery] Name: pp ingredients, dtype: object
After lemmatizing the data, all ingredients got lemmatized and became into singular form.
Preprocessed Text Visualization¶
# flatten all ingredients values into a 1D list
flatten_ingredients = [ingredient for ingredients in df['ingredients'].tolist() for ingredient in ingredients]
flatten_pp_ingredients = [ingredient for ingredients in df['pp ingredients'].tolist() for ingredient in ingredients]
# Count the frequency of the ingredients
ingredient_counts = Counter(flatten_ingredients)
ingredient_pp_counts = Counter(flatten_pp_ingredients)
# Get the top 30 most common ingredients
top_ingredients = ingredient_counts.most_common(30)
steps, counts = zip(*top_ingredients)
top_pp_ingredients = ingredient_pp_counts.most_common(30)
pp_ingredients, pp_counts = zip(*top_pp_ingredients)
fig = sp.make_subplots(rows=2, cols=1, vertical_spacing=0.2)
fig.add_trace(
go.Bar(x=steps, y=counts, name='Original Ingredients'),
row=1, col=1
)
fig.add_trace(
go.Bar(x=pp_ingredients, y=pp_counts, name='Processed Ingredients'),
row=2, col=1
)
# Create the bar plot
fig.update_layout(title_text="Top 30 Ingredient Frequencies Before and After Processing", height=500)
fig.show()
The bar plots indicate that salt, butter and sugar are the top 3 most frequent ingredients both before and after preprocessing with the same frequency, as for egg, it moved from fifth place before preprocessing to fourth place, this might be mainly because in some ingredients it was identified as egg and in others it was identified as eggs and the preprocessing unified them as one ingredient egg. As for the fifth place, it was taken by onion which moved from the sixth place.
print(
"Number of unique ingredients before preprocessing was: ",
len(ingredient_counts),
"\nNumber of unique ingredients after preprocessing is: ",
len(ingredient_pp_counts),
"\nThis shows that the preprocessing unified a lot of similar ingredients that were previously written with slight changes\nand narrowed the number of different ingredients by",
len(ingredient_counts) - len(ingredient_pp_counts),
"decreasing the dimensionality of the data,",
"\nwhich will help in improving the performance of the ML models"
)
Number of unique ingredients before preprocessing was: 8511 Number of unique ingredients after preprocessing is: 7717 This shows that the preprocessing unified a lot of similar ingredients that were previously written with slight changes and narrowed the number of different ingredients by 794 decreasing the dimensionality of the data, which will help in improving the performance of the ML models
4. Text Representation¶
Three different text representation techniques are performed on the preprocessed ingredients to convert them into vectors in preparation for the classification models development.
Bag of Words (BoW)¶
The first technique is bag of words to represent the ingredients as vectors of words frequencies.
c_vectorizer = CountVectorizer()
bow_ing = c_vectorizer.fit_transform(df["pp ingredients"].apply(' '.join))
# calculate the term frequency for each term
sum_words = bow_ing.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in c_vectorizer.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
# create a DataFrame with only the top 30 terms in terms of frequency
bow_top_30 = words_freq[:30]
bow_top_30_df = pd.DataFrame(bow_top_30, columns=['Term', 'Frequency'])
Term Frequency - Inverse Document Frequency (TF-IDF)¶
The second technique is term frequency - inverse document frequency which is an advanced form of bag of words that considers the importance of the term in the entire set.
# initialize TfidfVectorizer
t_vectorizer = TfidfVectorizer()
# learn the 'vocabulary' of the documents and transform the documents into a document-term matrix
tfidf_ing = t_vectorizer.fit_transform(df["pp ingredients"].apply(' '.join))
# calculate the term frequency for each term
sum_words = tfidf_ing.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in t_vectorizer.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
# create a DataFrame with only the top 30 terms in terms of frequency
tfidf_top_30 = words_freq[:30]
tfidf_top_30_df = pd.DataFrame(tfidf_top_30, columns=['Term', 'TF-IDF score'])
# create bar plots for top 30 frequencies of BoW and TF-IDF
fig = sp.make_subplots(rows=2, cols=1)
fig.add_trace(
go.Bar(x=bow_top_30_df['Term'], y=bow_top_30_df['Frequency'], name='BoW'),
row=1, col=1
)
fig.add_trace(
go.Bar(x=tfidf_top_30_df['Term'], y=tfidf_top_30_df['TF-IDF score'], name='TF-IDF'),
row=2, col=1
)
# Create the bar plot
fig.update_layout(title_text="Top 30 Words Frequencies with BoW and TF-IDF", height=500)
fig.show()
The bar plots indicate that some words are have a higher frequency in the ingredients data. However, when considering the importance of these frequent words with TF-IDF compared to their frequency, some of them have lower importance such as 'salt' and 'pepper' while others like 'butter' have higher importance.
Word Embeddings with Word2Vec¶
The third technique is word embedding using word2vec to create embeddings for the ingredients that capture the semantic relationship between the words.
# Train a Word2Vec model
w2v_ing = Word2Vec(df["pp ingredients"], min_count=1)
# Select a random sample of words
similar_words = w2v_ing.wv.most_similar("chocolate", topn=30)
words = [word for word, _ in similar_words]
similarities = [similarity for _, similarity in similar_words]
vectors = w2v_ing.wv[words]
# Perform PCA
pca = PCA(n_components=2)
result = pca.fit_transform(vectors)
# Create a DataFrame with the PCA results
df2 = pd.DataFrame(result, columns=["PC1", "PC2"])
df2["word"] = words
df2["similarity"] = similarities
# Create a scatter plot using Plotly Express
fig = px.scatter(
df2,
x="PC1",
y="PC2",
text="word",
color="similarity",
hover_data=["word", "similarity"],
title="Top 30 words similar to the word 'Chocolate'",
)
# Update layout properties
fig.update_traces(textposition="bottom right")
# Show the plot
fig.show()
def embed_ingredients(ingredients):
# get a list of each recipe's ingredients vectors
ingredients_vec = [
w2v_ing.wv[ingredient]
for ingredient in ingredients
if ingredient in w2v_ing.wv
]
# If the recipe had no ingredients in the model's vocabulary, return a zero vector
if len(ingredients_vec) == 0:
return np.zeros(w2v_ing.vector_size)
# Otherwise, return the vectors
return ingredients_vec
# insert the embeddings in a new column in the dataframe
df['ing_embeddings'] = df['pp ingredients'].apply(embed_ingredients)
5. Text Classification / Prediction¶
The first step before building the classification model is to identify the input and target before splitting the data into train and test sets. For the input (ingredients), they are fed to the models as Bag of Words or TF-IDF vectors.
# target
y = df["label"]
Text Classification with Bag of Words vectors¶
X_bow = bow_ing
X_train_bow, X_test_bow, y_train, y_test = train_test_split(X_bow, y, test_size=0.2, random_state=42)
Random Forest Classification¶
rfc_bow = RandomForestClassifier(random_state=42)
rfc_bow.fit(X_train_bow, y_train)
# Make predictions
y_pred_rfc_bow = rfc_bow.predict(X_test_bow)
Support Vector Machine Classification¶
svc_bow = SVC()
svc_bow.fit(X_train_bow, y_train)
# Make predictions
y_pred_svc_bow = svc_bow.predict(X_test_bow)
K Nearest Neighbors Classification¶
knn_bow = KNeighborsClassifier(n_neighbors=15)
knn_bow.fit(X_train_bow, y_train)
# Make predictions
y_pred_knn_bow = knn_bow.predict(X_test_bow)
Text Classification with TF-IDF vectors¶
# input with TF-IDF vectorization
X_tfidf = tfidf_ing
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)
Random Forest Classification¶
rfc_tfidf = RandomForestClassifier(random_state=42)
rfc_tfidf.fit(X_train_tfidf, y_train)
# Make predictions
y_pred_rfc_tfidf = rfc_tfidf.predict(X_test_tfidf)
Support Vector Machine Classification¶
svc_tfidf = SVC()
svc_tfidf.fit(X_train_tfidf, y_train)
# Make predictions
y_pred_svc_tfidf = svc_tfidf.predict(X_test_tfidf)
K Nearest Neighbors Classification¶
knn_tfidf = KNeighborsClassifier(n_neighbors=15)
knn_tfidf.fit(X_train_tfidf, y_train)
# Make predictions
y_pred_knn_tfidf = knn_tfidf.predict(X_test_tfidf)
6. Evaluation, Inferences, Recommendations and Reflection¶
Evaluation¶
# Display classification reports for each model
print("Bag of Words Vectorizer")
print("_"*100)
print("\nRandom Forest Classifier:")
print(classification_report(y_test, y_pred_rfc_bow))
print("\nSupport Vector Machine Classifier:")
print(classification_report(y_test, y_pred_svc_bow))
print("K-Nearest Neighbors Classifier:")
print(classification_report(y_test, y_pred_knn_bow))
print("="*100)
print("TF-IDF Vectorizer")
print("_"*100)
print("\nRandom Forest Classifier:")
print(classification_report(y_test, y_pred_rfc_tfidf))
print("\nSupport Vector Machine Classifier:")
print(classification_report(y_test, y_pred_svc_tfidf))
print("K-Nearest Neighbors Classifier:")
print(classification_report(y_test, y_pred_knn_tfidf))
Bag of Words Vectorizer
____________________________________________________________________________________________________
Random Forest Classifier:
precision recall f1-score support
0 0.69 0.77 0.73 2009
1 0.62 0.63 0.62 2023
2 0.60 0.52 0.56 1968
accuracy 0.64 6000
macro avg 0.64 0.64 0.64 6000
weighted avg 0.64 0.64 0.64 6000
Support Vector Machine Classifier:
precision recall f1-score support
0 0.73 0.76 0.74 2009
1 0.64 0.65 0.65 2023
2 0.61 0.57 0.59 1968
accuracy 0.66 6000
macro avg 0.66 0.66 0.66 6000
weighted avg 0.66 0.66 0.66 6000
K-Nearest Neighbors Classifier:
precision recall f1-score support
0 0.74 0.37 0.49 2009
1 0.45 0.77 0.57 2023
2 0.54 0.41 0.47 1968
accuracy 0.52 6000
macro avg 0.58 0.52 0.51 6000
weighted avg 0.58 0.52 0.51 6000
====================================================================================================
TF-IDF Vectorizer
____________________________________________________________________________________________________
Random Forest Classifier:
precision recall f1-score support
0 0.67 0.79 0.72 2009
1 0.63 0.61 0.62 2023
2 0.60 0.51 0.55 1968
accuracy 0.64 6000
macro avg 0.63 0.63 0.63 6000
weighted avg 0.63 0.64 0.63 6000
Support Vector Machine Classifier:
precision recall f1-score support
0 0.74 0.76 0.75 2009
1 0.65 0.65 0.65 2023
2 0.61 0.59 0.60 1968
accuracy 0.67 6000
macro avg 0.67 0.67 0.67 6000
weighted avg 0.67 0.67 0.67 6000
K-Nearest Neighbors Classifier:
precision recall f1-score support
0 0.69 0.68 0.68 2009
1 0.57 0.62 0.59 2023
2 0.56 0.50 0.53 1968
accuracy 0.60 6000
macro avg 0.60 0.60 0.60 6000
weighted avg 0.60 0.60 0.60 6000
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 10))
# Row 1
skplt.metrics.plot_confusion_matrix(
y_test, y_pred_rfc_bow,
normalize=True,
title="Random Forest with Bag of Words",
cmap="Blues",
ax=axes[0, 0]
)
skplt.metrics.plot_confusion_matrix(
y_test, y_pred_svc_bow,
normalize=True,
title="Support Vector Machine with Bag of Words",
cmap="Blues",
ax=axes[0, 1]
)
skplt.metrics.plot_confusion_matrix(
y_test, y_pred_knn_bow,
normalize=True,
title="K Nearest Neighbors with Bag of Words",
cmap="Blues",
ax=axes[0, 2]
)
# Row 2
skplt.metrics.plot_confusion_matrix(
y_test, y_pred_rfc_tfidf,
normalize=True,
title="Random Forest with TF-IDF",
cmap="Purples",
ax=axes[1, 0]
)
skplt.metrics.plot_confusion_matrix(
y_test, y_pred_svc_tfidf,
normalize=True,
title="Support Vector Machine with TF-IDF",
cmap="Purples",
ax=axes[1, 1]
)
skplt.metrics.plot_confusion_matrix(
y_test, y_pred_knn_tfidf,
normalize=True,
title="K Nearest Neighbors with TF-IDF",
cmap="Purples",
ax=axes[1, 2]
)
# Adjust layout for better appearance
plt.tight_layout()
# Show the plot
plt.show()
The Labels:
- 0 -> High Protein Diet
- 1 -> Low Calorie Diet
- 2 -> Other Diet
Inferences¶
Based on the evaluation of the six created models, it can be seen from both the classification reports and the confusion matrices that the support vector machine has the highest accuracy with tf-idf vectors at 67% followed by support vector machine as well but with bag of words at 66% and then Random Forest with both Bag of Words and TF-IDF vectors at 64%.
Multiple Deep Learning models including LSTM and RNN were tested but provided bad results besides other machine learning models such as logistic regression, and Naive Bayes. The same models "Random Forest", "SVC", "KNN" were tested with word2vec embeddings as well but had lower accuracy than Bag of Words and TF-IDF vectors trained models. This might be due to the complexity of word2vec which makes it more suitable with deep learning models.
This inference leads to the conclusion that there might not be a strong relationship between the ingredients and the calories and protein intakes. It is possible that the performance of the models can be improved by including additional textual information related to the diet.
Another inference possibility is that the developed models are not suitable and other models could have performed better.
Recommendations and Reflection¶
As recommendations on the development of this project, it would be good to experiment more with the dataset and try different categories and labels besides the diet such as the cuisine. Moreover, applying neural networks might aid in providing more accurate results but with larger datasets as with the current dataset, both RNN and LSTM models were tested both with tensorflow tokens and with word2vec embeddings but didn't perform well. In addition, the dataset could also be used for a text generation model where the input is the ingredients and the model has to generate the preparation steps for a recipe with these ingredients which is an interesting use case.
As for reflection on the project, the results of the developed models could have been further improved by experimenting with more classification models and models hyper parameters. The use of cross-validation could have been useful as well in improving the performance.
Log files¶
Github Link¶
References¶
Balodi, T. (2020, July 14). What is Stemming and Lemmatization in NLP? Retrieved from analyticssteps: https://www.analyticssteps.com/blogs/what-stemming-and-lemmatization-nlp
Li, S. (2019). Food.com Recipes and Interactions, [Data set]. Kaggle. doi: https://doi.org/10.34740/KAGGLE/DSV/783630
Reading Food Nutrition Labels. (n.d.). Retrieved from Washington State Department of Social and Health Services: https://www.dshs.wa.gov/sites/default/files/ALTSA/stakeholders/documents/duals/toolkit/Reading%20Food%20Nutrition%20Labels.pdf
Wempen, K. (2022, April 29). Are you getting too much protein? Retrieved from mayo clinic health system: https://www.mayoclinichealthsystem.org/hometown-health/speaking-of-health/are-you-getting-too-much-protein#:~:text=General%20recommendations%20are%20to%20consume,30%20grams%20at%20one%20time.